{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Binary Classification with the UCI Credit-card Default Dataset\n",
    "_**Mitigating disparities in false-positive rates and false-negative rates**_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "1. [What is Covered](#What-is-Covered)\n",
    "1. [Introduction](#Introduction)\n",
    "1. [The UCI Credit-card Default Dataset](#The-UCI-Credit-card-Default-Dataset)\n",
    "1. [Using a Fairness Unaware Model](#Using-a-Fairness-Unaware-Model)\n",
    "1. [Mitigating Equalized Odds Difference with Postprocessing](#Mitigating-Equalized-Odds-Difference-with-Postprocessing)\n",
    "1. [Mitigating Equalized Odds Difference with GridSearch](#Mitigating-Equalized-Odds-Difference-with-GridSearch)\n",
    "1. [Conclusion](#Conclusion)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is Covered\n",
    "\n",
    "* **Domain:**\n",
    "  * Finance (loan decisions). The data is semisynthetic to create a simple example of disparities in FPR and FNR.\n",
    "\n",
    "* **ML task:**\n",
    "  * Binary classification.\n",
    "\n",
    "* **Fairness tasks:**\n",
    "  * Assessment of unfairness using Fairlearn metrics.\n",
    "  * Mitigation of unfairness using Fairlearn mitigation algorithms.\n",
    "\n",
    "* **Performance metrics:**\n",
    "  * Area under ROC curve.\n",
    "  * Balanced accuracy.\n",
    "\n",
    "* **Fairness metrics:**\n",
    "  * False-positive rate difference.\n",
    "  * False-negative rate difference.\n",
    "  * Equalized-odds difference.\n",
    "\n",
    "* **Mitigation algorithms:**\n",
    "  * `fairlearn.reductions.GridSearch`\n",
    "  * `fairlearn.postprocessing.ThresholdOptimizer`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "In this notebook, we consider a scenario where algorithmic tools are deployed to predict the likelihood that an applicant will default on a credit-card loan. The notebook emulates the problem presented in this [white paper](https://www.microsoft.com/en-us/research/uploads/prod/2020/09/Fairlearn-EY_WhitePaper-2020-09-22.pdf) in collaboration with EY.\n",
    "\n",
    "Due to data privacy, we do not use the data from the white paper. Instead, we use the [UCI Credit-card default dataset](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients), a toy dataset reflecting credit-card defaults in Taiwan, as a substitute dataset to replicate the desired workflow. To make this dataset applicable to our problem, we introduce a synthethic feature that is highly predictive for applicants defined as \"female\" in terms of the \"sex\" feature, but is uninformative for applicants defined as \"male\".\n",
    "\n",
    "We train a fairness-unaware algorithm on this dataset and show the model has a higher false-positive rate as well as a higher false-negative rate for the \"male\" group than for the \"female\" group. We then use Fairlearn to mitigate this disparity using both the `ThresholdOptimizer` and `GridSearch` algorithms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# General imports\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "# Data processing\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Models\n",
    "import lightgbm as lgb\n",
    "from sklearn.calibration import CalibratedClassifierCV\n",
    "\n",
    "# Fairlearn algorithms and utils\n",
    "from fairlearn.postprocessing import ThresholdOptimizer\n",
    "from fairlearn.reductions import GridSearch, EqualizedOdds\n",
    "\n",
    "# Metrics\n",
    "from fairlearn.metrics import (\n",
    "    MetricFrame,\n",
    "    selection_rate, demographic_parity_difference, demographic_parity_ratio,\n",
    "    false_positive_rate, false_negative_rate,\n",
    "    false_positive_rate_difference, false_negative_rate_difference,\n",
    "    equalized_odds_difference)\n",
    "from sklearn.metrics import balanced_accuracy_score, roc_auc_score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The UCI Credit-card Default Dataset\n",
    "\n",
    "The UCI dataset contains data on 30,000 clients and their credit card transactions at a bank in Taiwan. In addition to static client features, the dataset contains the history of credit card bill payments between April and September 2005, as well as the balance limit of the client's credit card. The target is whether the client will default on a card payment in the following month, October 2005. A model trained on this data could be used, in part, to determine whether a client is eligible for another loan or a credit increase."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the data\n",
    "data_url = \"http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls\"\n",
    "dataset = pd.read_excel(io=data_url, header=1).drop(columns=['ID']).rename(columns={'PAY_0':'PAY_1'})\n",
    "dataset.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dataset columns:\n",
    "\n",
    "* `LIMIT_BAL`: credit card limit, will be replaced by a synthetic feature\n",
    "* `SEX, EDUCATION, MARRIAGE, AGE`: client demographic features\n",
    "* `BILL_AMT[1-6]`: amount on bill statement for April-September\n",
    "* `PAY_AMT[1-6]`: payment amount for April-September\n",
    "* `default payment next month`: target, whether the client defaulted the following month"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract the sensitive feature\n",
    "A = dataset[\"SEX\"]\n",
    "A_str = A.map({ 2:\"female\", 1:\"male\"})\n",
    "# Extract the target\n",
    "Y = dataset[\"default payment next month\"]\n",
    "categorical_features = ['EDUCATION', 'MARRIAGE','PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']\n",
    "for col in categorical_features:\n",
    "    dataset[col] = dataset[col].astype('category')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Introduce a Synthetic Feature\n",
    "\n",
    "We manipulate the balance-limit feature `LIMIT_BAL` to make it highly predictive for the \"female\" group but not for the \"male\" group. Specifically, we set this up, so that a lower credit limit indicates that a female client is less likely to default, but provides no information on a male client's probability of default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dist_scale = 0.5\n",
    "np.random.seed(12345)\n",
    "# Make 'LIMIT_BAL' informative of the target\n",
    "dataset['LIMIT_BAL'] = Y + np.random.normal(scale=dist_scale, size=dataset.shape[0])\n",
    "# But then make it uninformative for the male clients\n",
    "dataset.loc[A==1, 'LIMIT_BAL'] = np.random.normal(scale=dist_scale, size=dataset[A==1].shape[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 4), sharey=True)\n",
    "# Plot distribution of LIMIT_BAL for men\n",
    "dataset['LIMIT_BAL'][(A==1) & (Y==0)].plot(kind='kde', label=\"Payment on time\", ax=ax1, \n",
    "                                           title=\"LIMIT_BAL distribution for \\\"male\\\" group\")\n",
    "dataset['LIMIT_BAL'][(A==1) & (Y==1)].plot(kind='kde', label=\"Default\", ax=ax1)\n",
    "# Plot distribution of LIMIT_BAL for women\n",
    "dataset['LIMIT_BAL'][(A==2) & (Y==0)].plot(kind='kde', label=\"Payment on time\", ax=ax2, \n",
    "                                           legend=True, title=\"LIMIT_BAL distribution for \\\"female\\\" group\")\n",
    "dataset['LIMIT_BAL'][(A==2) & (Y==1)].plot(kind='kde', label=\"Default\", ax=ax2, \n",
    "                                           legend=True).legend(bbox_to_anchor=(1.6, 1))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note in the above figures that the new `LIMIT_BAL` feature is indeed highly predictive for the \"female\" group, but not for the \"male\" group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train-test split\n",
    "df_train, df_test, Y_train, Y_test, A_train, A_test, A_str_train, A_str_test = train_test_split(\n",
    "    dataset.drop(columns=['SEX', 'default payment next month']), \n",
    "    Y, \n",
    "    A, \n",
    "    A_str,\n",
    "    test_size = 0.3, \n",
    "    random_state=12345,\n",
    "    stratify=Y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using a Fairness Unaware Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We train an out-of-the-box `lightgbm` model on the modified data and assess several fairness metrics. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lgb_params = {\n",
    "    'objective' : 'binary',\n",
    "    'metric' : 'auc',\n",
    "    'learning_rate': 0.03,\n",
    "    'num_leaves' : 10,\n",
    "    'max_depth' : 3\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = lgb.LGBMClassifier(**lgb_params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit(df_train, Y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Scores on test set\n",
    "test_scores = model.predict_proba(df_test)[:, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train AUC\n",
    "roc_auc_score(Y_train, model.predict_proba(df_train)[:, 1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Predictions (0 or 1) on test set\n",
    "test_preds = (test_scores >= np.mean(Y_train)) * 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# LightGBM feature importance \n",
    "lgb.plot_importance(model, height=0.6, title=\"Features importance (LightGBM)\", importance_type=\"gain\", max_num_features=15) \n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We notice that the synthetic feature `LIMIT_BAL` appears as the most important feature in this model although it has no predictive power for an entire demographic segment in the data.\n",
    "\n",
    "We next use Fairlearn's `MetricFrame` to examine the the two different kinds of errors (false positives and false negatives) on the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mf = MetricFrame({\n",
    "    'FPR': false_positive_rate,\n",
    "    'FNR': false_negative_rate},\n",
    "    Y_test, test_preds, sensitive_features=A_str_test)\n",
    "\n",
    "mf.by_group"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that both kinds of errors are more common in the \"male\" group than in the \"female\" group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper functions\n",
    "def get_metrics_df(models_dict, y_true, group):\n",
    "    metrics_dict = {\n",
    "        \"Overall selection rate\": (\n",
    "            lambda x: selection_rate(y_true, x), True),\n",
    "        \"Demographic parity difference\": (\n",
    "            lambda x: demographic_parity_difference(y_true, x, sensitive_features=group), True),\n",
    "        \"Demographic parity ratio\": (\n",
    "            lambda x: demographic_parity_ratio(y_true, x, sensitive_features=group), True),\n",
    "        \"------\": (lambda x: \"\", True),\n",
    "        \"Overall balanced error rate\": (\n",
    "            lambda x: 1-balanced_accuracy_score(y_true, x), True),\n",
    "        \"Balanced error rate difference\": (\n",
    "            lambda x: MetricFrame(metrics=balanced_accuracy_score, y_true=y_true, y_pred=x, sensitive_features=group).difference(method='between_groups'), True),\n",
    "        \" ------\": (lambda x: \"\", True),\n",
    "        \"False positive rate difference\": (\n",
    "            lambda x: false_positive_rate_difference(y_true, x, sensitive_features=group), True),\n",
    "        \"False negative rate difference\": (\n",
    "            lambda x: false_negative_rate_difference(y_true, x, sensitive_features=group), True),\n",
    "        \"Equalized odds difference\": (\n",
    "            lambda x: equalized_odds_difference(y_true, x, sensitive_features=group), True),\n",
    "        \"  ------\": (lambda x: \"\", True),\n",
    "        \"Overall AUC\": (\n",
    "            lambda x: roc_auc_score(y_true, x), False),\n",
    "        \"AUC difference\": (\n",
    "            lambda x: MetricFrame(metrics=roc_auc_score, y_true=y_true, y_pred=x, sensitive_features=group).difference(method='between_groups'), False),\n",
    "    }\n",
    "    df_dict = {}\n",
    "    for metric_name, (metric_func, use_preds) in metrics_dict.items():\n",
    "        df_dict[metric_name] = [metric_func(preds) if use_preds else metric_func(scores) \n",
    "                                for model_name, (preds, scores) in models_dict.items()]\n",
    "    return pd.DataFrame.from_dict(df_dict, orient=\"index\", columns=models_dict.keys())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We calculate several performance and fairness metrics below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Metrics\n",
    "models_dict = {\"Unmitigated\": (test_preds, test_scores)}\n",
    "get_metrics_df(models_dict, Y_test, A_str_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the overall performance metric we use the _area under ROC curve_ (AUC), which is suited to classification problems with a large imbalance between positive and negative examples. For binary classifiers, this is the same as _balanced accuracy_.\n",
    "\n",
    "As the fairness metric we use *equalized odds difference*, which quantifies the disparity in accuracy experienced by different demographics. Our goal is to assure that neither of the two groups (\"male\" vs \"female\") has substantially larger false-positive rates or false-negative rates than the other group. The equalized odds difference is equal to the larger of the following two numbers: (1) the difference between false-positive rates of the two groups, (2) the difference between false-negative rates of the two groups.\n",
    "\n",
    "The table above shows the overall AUC of 0.85 (based on continuous predictions) and the overall balanced error rate of 0.22 (based on 0/1 predictions). Both of these are satisfactory in our application context. However, there is a large disparity in accuracy rates (as indicated by the balanced error rate difference) and even larger when we consider the equalized-odds difference. As a sanity check, we also show the demographic parity ratio, whose level (slightly above 0.8) is considered satisfactory in this context."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mitigating Equalized Odds Difference with Postprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We attempt to mitigate the disparities in the `lightgbm` predictions using the Fairlearn postprocessing algorithm `ThresholdOptimizer`. This algorithm finds a suitable threshold for the scores (class probabilities) produced by the `lightgbm` model by optimizing the accuracy rate under the constraint that the equalized odds difference (on training data) is zero. Since our goal is to optimize balanced accuracy, we resample the training data to have the same number of positive and negative examples. This means that `ThresholdOptimizer` is effectively optimizing balanced accuracy on the original data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "postprocess_est = ThresholdOptimizer(\n",
    "    estimator=model,\n",
    "    constraints=\"equalized_odds\",\n",
    "    prefit=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Balanced data set is obtained by sampling the same number of points from the majority class (Y=0)\n",
    "# as there are points in the minority class (Y=1)\n",
    "balanced_idx1 = df_train[Y_train==1].index\n",
    "pp_train_idx = balanced_idx1.union(Y_train[Y_train==0].sample(n=balanced_idx1.size, random_state=1234).index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train_balanced = df_train.loc[pp_train_idx, :]\n",
    "Y_train_balanced = Y_train.loc[pp_train_idx]\n",
    "A_train_balanced = A_train.loc[pp_train_idx]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "postprocess_est.fit(df_train_balanced, Y_train_balanced, sensitive_features=A_train_balanced)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "postprocess_preds = postprocess_est.predict(df_test, sensitive_features=A_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "models_dict = {\"Unmitigated\": (test_preds, test_scores),\n",
    "              \"ThresholdOptimizer\": (postprocess_preds, postprocess_preds)}\n",
    "get_metrics_df(models_dict, Y_test, A_str_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `ThresholdOptimizer` algorithm significantly reduces the disparity according to multiple metrics. However, the performance metrics (balanced error rate as well as AUC) get worse. Before deploying such a model in practice, it would be important to examine in more detail why we observe such a sharp trade-off. In our case it is because the available features are much less informative for one of the demographic groups than for the other.\n",
    "\n",
    "Note that unlike the unmitigated model, `ThresholdOptimizer` produces 0/1 predictions, so its balanced error rate difference is equal to the AUC difference, and its overall balanced error rate is equal to 1 - overall AUC."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mitigating Equalized Odds Difference with GridSearch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now attempt to mitigate disparities using the `GridSearch` algorithm. Unlike `ThresholdOptimizer`, the predictors produced by `GridSearch` do not access the sensitive feature at test time. Also, rather than training a single model, we train multiple models corresponding to different trade-off points between the performance metric (balanced accuracy) and fairness metric (equalized odds difference)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Train GridSearch\n",
    "sweep = GridSearch(model,\n",
    "                   constraints=EqualizedOdds(),\n",
    "                   grid_size=50,\n",
    "                   grid_limit=3)\n",
    "\n",
    "sweep.fit(df_train_balanced, Y_train_balanced, sensitive_features=A_train_balanced)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sweep_preds = [predictor.predict(df_test) for predictor in sweep.predictors_] \n",
    "sweep_scores = [predictor.predict_proba(df_test)[:, 1] for predictor in sweep.predictors_] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "equalized_odds_sweep = [\n",
    "    equalized_odds_difference(Y_test, preds, sensitive_features=A_str_test)\n",
    "    for preds in sweep_preds\n",
    "]\n",
    "balanced_accuracy_sweep = [balanced_accuracy_score(Y_test, preds) for preds in sweep_preds]\n",
    "auc_sweep = [roc_auc_score(Y_test, scores) for scores in sweep_scores]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select only non-dominated models (with respect to balanced accuracy and equalized odds difference)\n",
    "all_results = pd.DataFrame(\n",
    "    {\"predictor\": sweep.predictors_, \"accuracy\": balanced_accuracy_sweep, \"disparity\": equalized_odds_sweep}\n",
    ") \n",
    "non_dominated = [] \n",
    "for row in all_results.itertuples(): \n",
    "    accuracy_for_lower_or_eq_disparity = all_results[\"accuracy\"][all_results[\"disparity\"] <= row.disparity] \n",
    "    if row.accuracy >= accuracy_for_lower_or_eq_disparity.max(): \n",
    "        non_dominated.append(True)\n",
    "    else:\n",
    "        non_dominated.append(False)\n",
    "\n",
    "equalized_odds_sweep_non_dominated = np.asarray(equalized_odds_sweep)[non_dominated]\n",
    "balanced_accuracy_non_dominated = np.asarray(balanced_accuracy_sweep)[non_dominated]\n",
    "auc_non_dominated = np.asarray(auc_sweep)[non_dominated]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot equalized odds difference vs balanced accuracy\n",
    "plt.scatter(balanced_accuracy_non_dominated, equalized_odds_sweep_non_dominated, label=\"GridSearch Models\")\n",
    "plt.scatter(balanced_accuracy_score(Y_test, test_preds),\n",
    "            equalized_odds_difference(Y_test, test_preds, sensitive_features=A_str_test), \n",
    "            label=\"Unmitigated Model\")\n",
    "plt.scatter(balanced_accuracy_score(Y_test, postprocess_preds), \n",
    "            equalized_odds_difference(Y_test, postprocess_preds, sensitive_features=A_str_test),\n",
    "            label=\"ThresholdOptimizer Model\")\n",
    "plt.xlabel(\"Balanced Accuracy\")\n",
    "plt.ylabel(\"Equalized Odds Difference\")\n",
    "plt.legend(bbox_to_anchor=(1.55, 1))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As intended, `GridSearch` models appear along the trade-off curve between the large balanced accuracy (but also large disparity), and low disparity (but worse balanced accuracy). This gives the data scientist a flexibility to select a model that fits the application context best."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot equalized odds difference vs AUC\n",
    "plt.scatter(auc_non_dominated, equalized_odds_sweep_non_dominated, label=\"GridSearch Models\")\n",
    "plt.scatter(roc_auc_score(Y_test, test_scores),\n",
    "            equalized_odds_difference(Y_test, test_preds, sensitive_features=A_str_test), \n",
    "            label=\"Unmitigated Model\")\n",
    "plt.scatter(roc_auc_score(Y_test, postprocess_preds), \n",
    "            equalized_odds_difference(Y_test, postprocess_preds, sensitive_features=A_str_test),\n",
    "            label=\"ThresholdOptimizer Model\")\n",
    "plt.xlabel(\"AUC\")\n",
    "plt.ylabel(\"Equalized Odds Difference\")\n",
    "plt.legend(bbox_to_anchor=(1.55, 1))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, `GridSearch` models appear along the trade-off curve between AUC and equalized odds difference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare GridSearch models with low values of equalized odds difference with the previously constructed models\n",
    "grid_search_dict = {\"GridSearch_{}\".format(i): (sweep_preds[i], sweep_scores[i])\n",
    "                    for i in range(len(sweep_preds))\n",
    "                    if non_dominated[i] and equalized_odds_sweep[i]<0.1}\n",
    "models_dict.update(grid_search_dict)\n",
    "get_metrics_df(models_dict, Y_test, A_str_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we explored how a fairness-unaware gradient boosted trees model performed on the classification task in contrast to the postprocessed `ThresholdOptimizer` model and the `GridSearch` model. The `ThresholdOptimizer` greatly reduced the disparity in performance across multiple fairness metrics. However the overall error rate and AUC for the `ThresholdOptimizer` model were worse compared to the fairness-unaware model. \n",
    "\n",
    "With the `GridSearch` algorithm, we trained multiple models that balance the trade-off between the balanced accuracy and the equalized odds fairness metric. After engaging with relevant stakeholders, the data scientist can deploy the model that balances the performance-fairness trade-off that meets the needs of the business."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
